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ABSTRACT 

We address the problem of fine-grained action localization 
from temporally untrimmed web videos. We assume that 
only weak video-level annotations are available for training. 
The goal is to use these weak labels to identify temporal 
segments corresponding to the actions, and learn models 
that generalize to unconstrained web videos. We find that 
web images queried by action names serve as well-localized 
highlights for many actions, but are noisily labeled. To 
solve this problem, we propose a simple yet effective method 
that takes weak video labels and noisy image labels as in¬ 
put, and generates localized action frames as output. This 
is achieved by cross-domain transfer between video frames 
and web images, using pre-trained deep convolutional neu¬ 
ral networks. We then use the localized action frames to 
train action recognition models with long short-term mem¬ 
ory networks. We collect a fine-grained sports action data 
set FGA-240 of more than 130,000 YouTube videos. It has 
240 fine-grained actions under 85 sports activities. Convinc¬ 
ing results are shown on the FGA-240 data set, as well as 
the THUMOS 2014 localization data set with untrimmed 
training videos. 

Categories and Subject Descriptors 

1.2.10 [Vision and Scene Understanding]: Video analy¬ 
sis 

General Terms 

Algorithms, Experimentation, Measurement 
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1. INTRODUCTION 

This paper addresses the problem of fine-grained action 
localization from unconstrained web videos. A fine-grained 
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Figure 1: Fine-grained actions are usually present as 
a tiny fraction within videos (top). Our framework 
uses cross-domain transfer from possibly noisy im¬ 
age search results (bottom) and identifies the action 
related images for both domains (marked in green). 


action takes place in a higher-level activity or event (e.g., 
jump shot and slam dunk in basketball, blow candle in birth¬ 
day party). Its instances are usually temporally localized 
within the videos, and share similar context with other fine¬ 
grained actions belonging to the same activity or event. 

Most existing work on action recognition focuses on action 
classification using pre-segmented short video clips [25| |14| 
, which assumes implicitly that the actions of interest are 
temporally segmented during both training and testing. The 
TREGVID Multimedia Event Recounting evaluation 
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well as THUMOS 14 Ghallenge both address action lo¬ 
calization in untrimmed video, but the typical approach in¬ 
volves training classifiers on temporally segmented action 
clips and testing using sliding window on untrimmed video. 
This setting does not scale to large action vocabularies, when 
data is collected from consumer video websites. Videos here 
are unconstrained in length, format (home videos vs. pro¬ 
fessional videos), and almost always only have video level 
annotations of actions. 

We assume that only video-level annotations are available 
for the fine-grained action localization problem. The abil¬ 
ity to localize fine grained actions in videos has important 
applications such as video highlighting, summarization, and 
automatic video transcription. It is also a challenging prob- 























lem for several reasons: first, fine-grained actions for any 
high-level activity or event are inherently similar since they 
take place in similar scene context; second, occurrences of 
the fine-grained actions are usually short (a few seconds) in 
training videos, making it difficult to associate the video¬ 
level labels to the occurrences. 

Our key observation is that one can exploit web images 
to help localize fine-grained actions in videos. As illustrated 
in Figure by using action names {basketball slam dunk) 
as queries, many of the image search results offer well lo¬ 
calized actions, though some of them are non-video like or 
irrelevant. Identifying action related frames from weakly su¬ 
pervised videos and filtering irrelevant image tags is hard in 
either modality by itself; however, it is easier to tackle these 
two problems together. This is due to our observation that 
although most of the video frames and web images which 
correspond to actions are visually similar, the distributions 
of non-action images from the video domain and the web 
image domain are usually very different. For example, in 
a video with a basketball slam dunk, non slam dunk frames 
in the video are mostly from a basketball game. The irrele¬ 
vant results returned by image search are more likely to be 
product shots, or cartoons. 

This motivates us to formulate a domain transfer problem 
between web images and videos. To allow domain transfer, 
we first treat the videos as a bag of frames, and use the 
feature activations from deep convolutional neural networks 
(CNN) as the common representation for images and 
frames. Suppose we have selected a set of video frames and 
a set of web images for every action, the domain transfer 
framework goes in two directions: video frames to web im¬ 
ages, and vice versa. For both directions, we use the selected 
images from the source domain to train action classifiers by 
fine-tuning the top layers of the CNN; we then apply the 
trained classifiers to the target domain. Each image in the 
target domain is assigned a confidence score given by its as¬ 
sociated action classifier from the source domain. By grad¬ 
ually filtering out the images with low scores, the bidirec¬ 
tional domain transfer can progress iteratively. In practice, 
we start from the video frames to web images direction, and 
randomly select the video frames for training. Since the 
non-action related frames are not likely to occur in web im¬ 
ages, the tuned CNN can be used to filter out the non-video 
like and irrelevant web images. The final domain transfer 
from web images is used to localize action related frames 
in videos. We term these action-related frames as loealized 
aetion frames (LAF). 

Videos are more than an unordered collection of frames. 
We choose long short-term memory (LSTM) networks as 
the temporal model. Compared with the traditional recur¬ 
rent neural networks (RNN), LSTM has built-in input gates 
and forget gates to control its memory cells. These gates 
allow LSTM to either keep a long term memory or forget 
its history. The ability to learn from long sequences with 
unknown size of background is well-suited for fine-grained 
action localization from unconstrained web videos. We treat 
every sampled video frame as a time step in LSTM. When 
we train LSTM models, we label all video frames by their 
video-level annotation, but use the LAF scores generated by 
bidirectional domain transfer as weights on the loss for mis- 
classification. By doing this, irrelevant frames are effectively 
down-weighted in the training stage. The framework can be 
naturally extended to use video shots as time steps, from 


which spatio-temporal features can be extracted to capture 
local motion information. 

Fine-grained action localization from untrimmed web videos 
is a new task. The closest existing data set is THUMOS 
2014 with 20 sports categories. It is designed for action lo¬ 
calization using segmented videos as training, but has 1,010 
untrimmed validation videos. To evaluate the framework 
in a large scale setting, we collected a new data set from 
YouTube. We chose 240 fine-grained actions belonging to 85 
different sports activities, the total number of videos is over 
130,000. Although the evaluated categories are sports ac¬ 
tions, this method can be easily extended to other domains. 
For example, one can easily get eut eake, eat eake and blow 
eandle images for a birthday party event with image search. 

Our work makes three major contributions: 

• We show that learning temporally localized actions 
from videos becomes much easier if we combine weakly 
labeled video frames and noisily tagged web images. 
This is achieved by a simple yet effective domain trans¬ 
fer algorithm. 

• We propose a localization framework that uses LSTM 
network with the localized action frames to model the 
temporal evolution of actions. 

• We introduce the problem of fine-grained action lo¬ 
calization with untrimmed videos, and collect a large 
fine-grained sports action data set with over 130,000 
videos in 240 categories. The data set is available on- 
lineQ 

2. RELATED WORK 

Most existing work on activity recognition focuses on clas¬ 
sifying pre-segmented clips. For example, UCF 101 data 
set and HMDB 51 data set provide 101 and 51 
activity categories respectively. Activity types range from 
primitive human actions, sports to playing instruments; the 
typical length of each video clip is 5 to 10 seconds. More 
recently, Karpathy et al. proposed the Sports-IM data 
set with more than 1 million untrimmed YouTube videos. 
Even though it offers 487 sports categories, most of them are 
high-level activities such as basketball and erieket Eor fine¬ 
grained action recognition, Rohrbach et al. collected a 
cooking action data set with temporal annotation, the videos 
were shot in an indoor kitchen with static camera. To the 
best of our knowledge, there is no previous work on fine¬ 
grained action localization with untrimmed training videos. 

Action recognition typically involves two basic steps: fea¬ 
ture extraction and classifier training. The standard ap¬ 
proach is to extract hand-designed low-level features, and 
then aggregate the features into fixed-length feature vec¬ 
tors for classification. Oneata et al. [16] showed that a 
combination of visual features (SIET [15] ) and motion fea¬ 
tures (DT [^) represented using Eisher Vectors pro¬ 
duced state-of-the-art activity and event classification per¬ 
formance. 

Recent approaches, particularly those based on deep neu¬ 
ral networks, jointly learn features and classifiers. Karpathy 
et al. proposed several variations of convolutional neural 
net (CNN) architectures that extended Krizhevsky et al.’s 

^https://sites.google.com/site/finegrainedactions/ 
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Figure 2: Illustration of the LAF proposal framework for basketball slam dunk videos. We use a cross-domain 
transfer algorithm to jointly filter out non-video like web images and non-action like video frames. We then 
learn a LAF proposal model using the filtered images, which assigns LAF scores to training video frames. 
Finally, we train LSTM based fine-grained action detectors, where the misclassification penalty of each time 
step is weighted by the LAF score. 


image classification model and attempted to learn mo¬ 
tion patterns from spatio-temporal video patches. Simonyan 
and Zisserman obtained good results on action recogni¬ 
tion using a two-stream CNN that takes pre-computed op¬ 
tical flow as well as raw frame-level pixels as input. 

There have been many attempts to address the action lo¬ 
calization problem. Tian et al. proposed a temporal 
extension of the deformable part model (DPM) for action 
detection, they used spatio-temporal bounding boxes for 
training. Wang et al. used dynamic poselets which took 
motion and pose into account. Jain et al. localized the 
actions with tubelets. All these approaches require manu¬ 
ally annotated spatio-temporal bounding boxes for training. 
For temporal localization, THUMOS 2014 data set pro¬ 
vides trimmed video segments to train action classihers, the 
classifiers can be used for localization with temporal sliding 
windows. 

The idea of using images as auxiliary data has also re¬ 
cently been explored. The most common usage is to learn 
mid-level concept detectors for high-level event classihca- 
tion miziii]- Yang et al. propose a domain adap¬ 
tation algorithm from images to videos, but assume perfect 
image annotation. Divvala et al. learn mixtures of object 
sub-categories using web images, they achieving this goal by 
hltering sub-categories with low classihcation performance. 

The long short-term memory network (LSTM) was pro¬ 
posed by Hochreiter and Schmidhuber as an improvement 
over traditional recurrent neural networks (RNN) for clas¬ 
sihcation and prediction of time series data. Specihcally, 


an LSTM can remember and forget values from the past, 
unlike a regular RNN where error gradients decay exponen¬ 
tially quickly with the time lag between events. It has re¬ 
cently shown excellent performance in modeling sequential 
data such as speech recognition |22[[5] , handwriting recogni¬ 
tion and machine translation |28| . More recently, LSTM 
has also been applied to video-level action classihcation 
[ 3 ^ and generate image descriptions . 

3. OUR APPROACH 

Our proposed hne-grained action localization framework 
uses both weakly labeled videos and noisily tagged web im¬ 
ages. It employs the same CNN based representation for 
web images and video frameworks, and uses a bidirectional 
domain transfer algorithm to hlter out irrelevant images in 
both domains. A localized action frame (LAF) proposal 
model is trained from the remaining web images, and used 
to assign LAF scores to video frames. Finally, we use long 
short-term memory networks to train hne-grained action de¬ 
tectors, using the LAF scores as the weight of loss for mis- 
classihcation. The pipeline is illustrated in Figure 

3.1 Shared CNN Representation 

A shared feature space is required for domain transfer 
between images and videos. Here we treat a video as a 
bag of frames, and extract activations from the intermediate 
layers of a convolutional neural network (CNN) as features 
for both web images and video frames. Although there is 
previous work on action recognition from still images using 
































































Figure 3: Top retrieved images from the Google im¬ 
age search engine, using keywords tennis serve and 
baseball pitch. 


etc., each of which may be as short as a few seconds. We 
address the problem of automatically identifying them from 
minutes-long videos. 


Algorithm 1: Domain transfer algorithm for localized 
action frame proposal. 

Input : Images with noisy labels frames with 

video-level labels {Vi^ai) 

Output: LAF proposal model 

Initialize X and V to include all image and frame inputs 
respectively. 

while stopping criteria not met do 

1. Fine-tune CNN^; using data in frame set V. 

2. Compute CNN^;(/) for all / G X. 

3. Update 1 = {I\I G X, CNN^(/)a, > 6>i}. 

4. Fine-tune CNN^ using data in image set X. 

5. Compute CNNi(U) for all U G V. 

6. Update V = {U|U G V,CNNi(U)a^ > O 2 }. 
end 

return CNNi 


Fortunately, we observe that many of the fine-grained ac¬ 
tions have image highlights on the Internet (Figure]^. They 
are easily obtained by querying image search engines with 
action names. However, these images are noisily labeled, 
and not useful for learning LAF proposal models directly, as 
they contain: 

• Irrelevant images due to image crawling error, for ex¬ 
ample, a jogging image could be retrieved with the 
keyword soccer dribbling. 


other representations [^, we choose CNN activations for its 
simplicity and state-of-the-art performance in several action 
recognition tasks . 

Training a CNN end-to-end from scratch is time consum¬ 
ing, and requires a large amount of annotated data. It has 
been shown that CNN weights trained from large image data 
sets like ImageNet are generic, and can be applied to 
other image classification tasks by fine-tuning. It is also 
possible to disable the error back-propagation for the first 
several layers during fine-tuning. This is equivalent to train¬ 
ing a shallower neural network using the intermediate CNN 
activations as features. 

In this paper, we adopt the methodology of fine-tuning the 
top layers of CNN, and experiment with the AlexNet 
CNN architecture. It contains five convolution layers and 
three fully connected layers. Each convolution layer is fol¬ 
lowed by a ReLU non-linearity layer and a maximum pooling 
layer. We pre-trained the network on ImageNet data set us¬ 
ing the data partitions defined in . We resized the images 
to 256 by 256, and used the raw pixels as inputs. For the 
purpose of fine-tuning, we fixed the network weights before 
its first fully connected layer and only updated the param¬ 
eters of the top three layers. Feature activations from fc6 
serve as the shared representation for web images and video 
frames, and allow cross-domain transfer between the two. 

3.2 LAF Proposal with Web Images 

Fine-grained actions tend to be more localized in videos 
than high-level activities. For example, a basketball match 
video usually consists of jump shot, slam dunk, free throw 


• Items related to the actions, such as objects and logos. 

• Images with the same action but from a different do¬ 
main, such as advertisement images with clear back¬ 
ground, or cartoons. 

Filtering the irrelevant web images is a challenging prob¬ 
lem by itself. However, it can be turned into an easier prob¬ 
lem by using weakly-supervised videos. We hypothesize that 
applying a classifier, learned on video frames, as a hlter on 
the images removes many irrelevant images and preserves 
most video-like image highlights. More formally, assume we 
have video frames in V and web images in X, and each of 
them is assigned a fine-grained action label a = 0,1,..., iV—1. 
We first learn a multi-class classifier CNN-i;(-) G by fine- 
tuning the top layers of CNN using video frames. CNN^;(-) 
encodes action discriminative information from the videos’ 
perspective; we apply it to all / G X, and update 

X = {/|/ G X, CNN,(/)a, > ^ 1 } (1) 

where 0i G [0,1] is the threshold for minimum softmax out¬ 
put, and CNN-i;(/)aj corresponds to the a/-th dimension of 
CNN,(/). 

We then use the filtered X to fine-tune CNNi(-) G 
and update V in a similar manner: 

V = {U|U G V, CNN,(U)a^ > O 2 } (2) 

We iterate the process and update V and X until certain 
stopping criteria are met. The LAF proposal model CNNi(-) 











is learned using the final web image set X, the LAF score for 
a video frame V with action label a is given by 

LAF(y) = CNN,(y)a (3) 

The whole process is summarized in Algorithm 

Discussion: We initialize the frame set V by random 
sampling. Even though many of the sampled frames do not 
correspond to the actions of interest, they can help filter 
out the non-video like web images such as cartoons, object 
photos and logos. In practice, the random sampling of video 
frames is adequate for this step since the mis-labeled frames 
rarely appear in the web image collection. 

We set the stopping criteria to be: (1) video-level classi¬ 
fication accuracy on a validation set starts to drop; or (2) 
a maximum number of iterations is reached. To be more 
efficient, we train one-vs-rest linear SVMs using frames in V 
after each iteration, and apply the classifiers to video frames 
in the validation set. We take the average of frame-level clas¬ 
sifier responses to generate video-level responses, and use 
them to compute classification accuracy. 

3.3 Long Short-Term Memory Network 

Long Short-term Memory (LSTM) is a type of recur- 
rent neural network (RNN) that solves the vanishing and 
exploding gradients problem of previous RNN architectures 
when trained using back-propagation. Standard LSTM ar¬ 
chitecture includes an input layer, a recurrent LSTM layer 
and an output layer. The recurrent LSTM layer has a set 
of memory cells, which are used to store real-valued state 
information from previous observations. This recurrent in¬ 
formation flow, from previous observations, is particularly 
useful for capturing temporal evolution in videos, which we 
hypothesize is useful in distinguishing between hne-grained 
sports activities. In addition, LSTM’s memory cells are pro¬ 
tected by input gates and forget gates, which allow it to 
maintain a long-term memory and reset its memory, respec¬ 
tively. We employ the modihcation to LSTMs proposed by 
Sak et al. to add a projection layer after the LSTM layer. 
This reduces the dimension of stored states in memory cells, 
and helps to make the training process faster. 

Let us denote the input sequence X as {xi, X 2 ,xt}, 
where in our case each xt is a feature vector of a video frame 
with time stamp t. LSTM maps the input sequence into the 
output action responses Y = {yi,y 2 , ...^yr} by: 


it = cr{WixXt + WirTt-i + WicCt -1 + bi) (4) 

ft = cr(WfxXt + WrfVt-l + WcfCt -1 + bf) (5) 

Ct = ftd) Ct-l -\-it(D giWcxXt + WcrTt-l + be) (6) 

Ot = (T(WoxXt + WorTt-l + WoeCt + ^o) (7) 

mt = OtO h{ct) (8) 

Tt — WrmTXlt (9) 

yt = WyrVt + by (10) 


Here VF’s and 5’s are the weight matrices and biases, re¬ 
spectively, and 0 denotes the element-wise multiplication 
operation, c is the memory cell activation; z, /, o are input 
gate, forget gate and output gate respectively, m and r are 
recurrent activation before and after projection, a is the 
sigmoid function, g and h are tank. An illustration of the 
LSTM architecture with a single memory block is shown in 
Figure 


n-i 



Figure 4: Illustration of LSTM architecture with a 
single memory block. A recurrent projection layer 
is added to red uce the number of parameters. Re¬ 
produced from [22| with the authors’ permission. 

Training LSTM with LAF scores. We sample video 
frames at 1 frame per second and treat each frame as a basic 
LSTM step. Similar to speech recognition tasks, each time 
step requires a label and a penalty weight for misclassifica- 
tion. The truncated backpropagation through time (BPTT) 
learning algorithm is used for training. We limit the 
maximum unrolling time steps to k and only back-propagate 
the error for k time steps. Incorporating the LAF scores into 
the LSTM framework is simple: we first run the LAF pro¬ 
posal pipeline to score all sampled training video frames. 
Then we set the frame-level labels based on video-level an¬ 
notation, but use the LAF scores as the penalty weights. 
Using this method, LSTM is forced to make the correct 
decision after watching a LAF returned by LAF proposal 
system, and it is not penalized as heavily when gathering 
context information from earlier frames or misclassifying an 
unrelated frame. 

Computing LAF scores for video shots. For some 
data sets, it might be desirable to use video shots as the 
basic LSTM steps, as it allows the use of spatio-temporal 
motion features for representation. We extend the frame- 
level LAF scores to shot-level by taking the average of LAF 
scores from the sampled frames within a certain video shot. 

4. EXPERIMENTS 

This section first describes the data set we collected for 
evaluation, and then presents experimental results. 

4.1 Data Set 

There is no existing data set for fine-grained action local¬ 
ization using untrimmed web videos. To evaluate our pro¬ 
posed method’s performance, we collected a Fine Grained 
Actions 240 (FGA-240) data set focusing on sports videos. It 
consists of over 130,000 YouTube videos in 240 categories. A 
subset of the categories is shown in FigureWe selected 85 
high-level sports activities from the Sports-IM data set [11] , 
and manually chose the fine-grained actions take place in 
these activities. The action categories cover aquatic sports, 
team sports, sports with animals and others. 

We decided the fine-grained categories for each high-level 
sports activity using the following method: given YouTube 
videos and their associated text data such as titles and de¬ 
scriptions, we ran an automatic text parser to recognize 
sports related entities. The recognized entities which cor¬ 
relate with the high-level sports activities were stored in the 






















Figure 5: Some of the high-level sports activities 
and their corresponding fine-grained sports actions 
in Fine Grained Actions 240 data set. Best viewed 
under magnification. 


pool and then manually filtered to keep only fine-grained 
sports actions. As an example, for basketball the initial en¬ 
tity pool contains not only fine-grained sports actions (e.g., 
slam dunk, block), but also game events (e.g., NBA) and 
celebrities (e.g., Kobe Bryant). Once the fine-grained cate¬ 
gories are fixed, we applied the same text analyzer to auto¬ 
matically assign video-level annotations, and only kept the 
videos with high annotation confidence. We finally visual¬ 
ized the data set to filter out false annotations and removed 
the fine-grained sports action categories with too few sam¬ 
ples. 

Our final data set contains 48,381 training videos and 
87,454 evaluation videos. The median number of training 
videos per category is 133. We used 20% of the evaluation 
videos for validation and 80% for testing. 

For temporal localization evaluation, we manually anno¬ 
tated 400 videos from 45 fine-grained actions. The average 
length of the videos is 79 seconds. 

4.2 Experiment Setup 

LSTM implementation. We used the feature activa¬ 
tions from pre-trained AlexNet (first fully-connected layer 
with 4,096 dimensions) as the input features for each time 
step. We followed the LSTM implementation by Sak et 
al. which utilizes a multi-core CPU on a single machine. 
For training, we used asynchronous stochastic gradient de¬ 
scent and set batch size to 12. We tuned the training param¬ 
eters on the validation videos and set the number of LSTM 
cells to 1024, learning rate to 0.0024 and learning rate decay 
with a factor of 0.1. We fixed the maximum unroll time step 
k to 20 to forward-propagate the activations and backward- 
propagate the errors. 


Video level classification. We evaluated fine-grained 
action classification results on video level. We sampled test 
video frames at 1 frame per second. Given T sampled frames 
from a video, these frames are forward-propagated through 
time, and produce T softmax activations. We used average 
fusion to aggregate the frame-level activations over whole 
videos. 

Temporal localization. We generated the frame-level 
softmax activations using the same approach as video level 
classification. We used a temporal sliding window of 10 time 
steps, the score of each sliding window was decided by tak¬ 
ing the average of softmax activations. We then applied 
non-maximum suppression to remove the localized windows 
which overlap with each other. 

Evaluation metric. For classification, we report Hit @k, 
which is the percentage of testing videos whose labels can 
be found in the top k results. For localization, we follow 
the same evaluation protocol as THUMOS 2014 and 
evaluate mean average precision. A detection is considered 
to be a correct one if its overlap with groundtruth is over 
some ratio r. 

CNN baseline. We deployed the single-frame architec¬ 
ture used by Karpathy et al. as the CNN baseline. It 
was shown to have comparable performance with multiple 
variations of CNNs while being simpler. We sampled the 
video frames at 1 frame per second, and used average fusion 
to aggregate softmax scores for classification and localiza¬ 
tion tasks. Instead of training a CNN from scratch, we used 
network parameters from the pre-trained AlexNet, and fine- 
tuned the top two fully-connected layers and a softmax layer. 
Training parameters were decided using the validation set. 

Low-level feature baseline. We extracted low-level fea¬ 
tures used by 35 over whole videos for classification 
task, the feature set includes low-level visual and motion 
features aggregated using bag-of-words. We used the same 
neural network architecture as with multiple Rectified 
Linear Units to build classifiers based on the low-level fea¬ 
tures. Its structure (e.g., number of layers, number of cells 
per layer) as well as training parameters were decided with 
validation set. 


4.3 Video-level Classification Results 

We first report the fine-grained action classification per¬ 
formance on video level. 

Comparison with baselines. We compared several 
baseline systems’ performance against our proposed method 
on FGA-240 data set, the results are shown in TableFrom 
the table we can see that systems based on CNN activations 
outperformed low-level features by a large margin. There 
are two possible reasons for this: first, CNN learned acti¬ 
vations are more discriminative in classifying fine-grained 
sports actions, even without capturing local motion pat¬ 
terns explicitly; second, low-level features were aggregated 
on video-level. These video-level features are more sensitive 
to background and irrelevant video segments, which happens 
a lot in fine-grained sports action videos. 

Among the systems relying on CNN activations, apply¬ 
ing LSTM gave better performance than fine-tuning the top 
layers of CNN. While both LSTM and CNN used the late 
fusion of frame-level softmax activations to generate video¬ 
level classification results, LSTM took previous observations 
into consideration with the help of memory cells. This shows 




Method 

Video Hit @1 

Video Hit @5 

Random 

0.4 

2.1 

Low-level features |35 

30.8 

- 

CNN 11 

37.3 

68.5 

LSTM w/o LAF 

41.1 

70.2 

LSTM w/ LAF 

43.4 

74.9 


Table 1: Video-level classification performance of 
several different systems on fine-grained actions. 
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ParkourFree Running 



Paragliding; Towing 



Freestyle BMX: Stunt 


Figure 6: Example actions when LAF is not helpful. 
The web images retrieved to generate LAF proposals 
might be beautified (top) or taken from different 
viewpoint (middle). Sometimes there is a mix of 
the two issues (bottom). 


that temporal information helps classify fine-grained sports 
actions, and it was captured by LSTM network. 

Finally, using LAF proposals helped further improve the 
video hit @1 by 2.3% and video hit @5 by 4.7%. In Table 
we show the relative difference in average precision for 
LSTM with and without LAF proposal. We observe that 
LAF proposal helps the most when the fine-grained sports 
actions are likely to be identified based on single frames, 
and the image highlights on the Internet are visually very 
similar to the videos. Note that there are still non-video-like 
and irrelevant images retrieved from the Internet for these 
categories, but the LAF proposal system is an effective filter. 
Figure gives the three systems’ output on a few example 
videos. 

We also identify several cases when LAF proposals failed 
to work. The most common case is when most of the re¬ 
trieved images are non-video like but not filtered out. They 
could be posed images or beautified images with logos, such 
as images retrieved for Parkour:Free running, or have dif¬ 
ferent viewpoints than videos, such as Paragliding:Towing. 


Fine-grained sports 

A AP 

A AP 

Fine-grained sports 

FencingrParry 

0.17 

-0.09 

Parkour:Free Running 

CricketiRun out 

0.15 

-0.08 

Freestyle soccenCrip Walk 

CrossFitiDeadlift 

0.10 

-0.08 

Paragliding:Towing 

CrossFitiHandstand 

0.09 

-0.07 

Freestyle BMX:Stunt 

Calisthenics:Push-up 

0.09 

-0.07 

JudoiSweep 

Rings:Pull-up 

0.08 

-0.06 

BasketbalkPoint 


Table 2: Difference in average precision between 
LSTM with and without LAF proposal. Sorted by 
top wins (left) and top losses (right). 


Method 

Video Hit @1 

Video Hit @5 

Random 

1.2 

5.9 

CNN 

B 

69.2 

75.9 

LSTM w/ 

o LAF 

71.7 

77.3 

LSTM w/ LAF 

73.6 

79.5 


Table 3: Classification performance when measured 
on high-level sports activities (e.g., basketball, soc¬ 
cer). 


Sample video snapshots and web images are shown in Fig¬ 
ure [6l 

Impact of action hierarchy. A fine-grained sports ac¬ 
tion could be misclassified to either its sibling or non-sibling 
leaf nodes in the sports hierarchy. For example, a basketball 
slam dunk can be confused with basketball alley-oop as well 
as street ball slam dunk. To study the source of confusion, 
we decided to measure classification accuracy of high-level 
sports activities, and check how the numbers compared with 
fine-grained sports actions. 

We obtain the confidence values for high-level sports ac¬ 
tivities by taking the average of their child nodes’ confidence 
scores. Tableshows the classification accuracy with differ¬ 
ent methods. We can see that the overall trend is the same 
as fine-grained sports actions: LSTM with LAF proposal 
is still the best. However, the numbers are much higher 
than when measured on fine-grained level, which indicates 
that the major source of confusion still comes from the fine¬ 
grained level. In Figure we provide the zoom-in confusion 
matrices for iee hoekey, erossfit and basketball. 

4.4 Localization Results 

Comparison with baselines. We applied the frame¬ 
works to localize fine-grained actions, and varied the over¬ 
lap ratio r from 0.1 to 0.5 for evaluation. Figure shows 
the mean average precision over all 45 categories of differ¬ 
ent systems. We did not include the baseline using low-level 
features for evaluation as they were computed over whole 
videos. From the figure we can see that LSTM with LAF 
proposal outperformed both CNN and LSTM without LAF 
proposal significantly, the gap grows wider as we increase 
the overlap ratio. This confirms that temporal information 
and LAF proposal are helpful for the temporal localization 
task. 

In Table ^ we show the most different average precisions 
on action level. Some actions have clearly benefited from 
the introduction of LSTM as well as LAF proposal. We also 
observed that some actions were completely missed by all 
three systems, such as Baseball:Hit, Basketball:Three-point 
field goal and Basketball:Bloek, possibly due to the video 
































Figure 7: Magnified view of confusion matrices for ice hockey, cross fit and basketball 



BasebalhThrowing Soccer:Direct freekick SkateboardingiNollie CrossFitiSquat 

TennisiForehand American football:Kickoff LongboardingiSlide CrossFit:Sit-up 

Tennis:Serve Rugby union:Try BMXiFlatland BMX CrossFitiSnatch 



BaseballiThrowing American footballiTouchdown BasketbalhJump shot Gymnastics:Trampoline 

Backpacking:Camping Soccer:Penalty kick BasketbalhDribbling Calisthenics:Push-up 

Knife throwing:Throwing American football:Interception ParkourFree Running Calisthenics:Pull-up 



BMX:Tailwhip Acrobatics:Acro dance BasebalhStolen base BasketbalhDribbling 

BMX:Tailwhip Jujutsu:Armlock BasebalhHit BasketbalhJump shot 

SkateboardingiKickflip BasketbalhFree throw SoftballiBunt BasebalhThrowing 


Figure 8: Classification output for a few videos. The labels under each video were generated by LSTM with 
LAF proposal, LSTM without LAF proposal and CNN from top to bottom. Correct answers are marked in 
bold. 




















overlap ratio 


Figure 9: Temporal localization performance on 
FGA-240 data set. 


Fine-grained sports 

A AP 

A AP 

Fine-grained sports 

Soccer:Penalty kick 

0.32 

-0.06 

BasebalhRun 

Tennis:Serve 

0.25 

-0.01 

Skateboarding: Kickflip 

Basketball: Dribbling 

0.21 

-0.01 

VolleybalkSpiking 


Fine-grained sports 

A AP 

A AP 

Fine-grained sports 

BasebalkBrawl 

0.52 

-0.11 

Streetball: Crossovers 

Ice hockey:Combat 

0.48 

-0.05 

Ice hockey:Penalty shot 

Soccer:Penalty kick 

0.33 

-0.04 

Fencing: Parry 


Table 4: Difference in average precision, compared 
between LSTM with and without LAF proposal 
(top), LSTM with LAF proposal and CNN (bot¬ 
tom), overlap ratio is fixed to 0.5. A positive number 
means LSTM with LAF proposal is better. 


frames corresponding to these actions not being well local¬ 
ized during training. 

4.5 Localization Results on THUMOS 2014 

To verify the effectiveness of domain transfer from web 
images, we also conducted a localization experiment on the 
THUMOS 2014 data set [^. This data set consists of over 
13,000 temporally trimmed videos from 101 actions, 1,010 
temporally untrimmed videos for validation and 2,574 tem¬ 
porally untrimmed videos for testing. The localization an¬ 
notations cover 20 out of the 101 actions in the validation 
and test sets. All 20 actions are sports related. 

Experiment setup: As this paper focuses on temporal 
localization of untrimmed videos, we dropped the 13,000 
trimmed videos, and used the untrimmed validation videos 
as the only positive samples for training. We also used 2,500 
background videos as the shared negative training data. 

To generate LAF scores, we downloaded web images from 
Flickr and Google using the action names as queries. We also 
sampled training video frames at 1 frame per second. We 
used the AlexNet features for domain transfer experiment. 

Recently, it has been shown that a combination of im¬ 
proved dense trajectory features and Fisher vector en¬ 
coding (iDT+FV) offers the state-of-the-art performance 



Overlap ratio 

Method 

0.1 

0.2 

0.3 

0.4 

0.5 

Ground truth 

0.161 

0.152 

0.112 

0.071 

0.044 

Video [27||19| 

0.098 

0.089 

0.071 

0.041 

0.024 

LSTM w/o' LAF 

0.076 

0.071 

0.057 

0.038 

0.024 

LSTM w/ LAF 

0.124 

0.110 

0.085 

0.052 

0.044 


Table 5: Temporal localization on the test partition 
of THUMOS 2014 dataset. Ground truth uses tem¬ 
poral annotation of the training videos. 

on this data set. This motivated us to switch LSTM time 
steps from frames to video segments, and represent segments 
with iDT+FV features for the hnal detector training. We 
segmented all videos uniformly with a window width of 100 
frames and step size of 50 frames. For iDT+FV feature 
extraction, we took only the MBH modality with 192 di¬ 
mensions and reduced the dimensions to 96 with PC A. We 
used the full Fisher vector formulation with the number of 
GMM cluster centers set to 128. The final video segment 
representation has 24,576 dimensions. 

Results: We compared the performance of LSTM weighted 
by LAF scores against several baselines. LSTM w/o LAF 
randomly assigned misclassihcation penalty for each step of 
LSTM, where 30% of the steps were set to 1, and others 0. 
The Video baseline used iDT+FV features aggregated over 
whole videos to train linear SVM classihers, and applied the 
classihers to the testing video shots. It was used by 

and achieved state-of-the-art performance in event re¬ 
counting and video summarization tasks. None of these sys¬ 
tems require temporal annotations. Finally, Ground truth 
employed manually annotated temporal localizations to set 
LSTM penalty weights. It is used to study the performance 
difference between LAF and an oracle with perfect localized 
actions. 

Table shows the mean average precision for the four 
approaches. As expected, using manually annotated ground 
truth for training provides the best localization performance. 
Although LSTM with LAF scores has worse performance 
than using ground truth, it outperforms LSTM without LAF 
scores, and the video-level baseline by large margins. This 
further conhrms that LAF proposal by domain transfer from 
web images is effective in action localization tasks. 

5. CONCLUSION 

We studied the problem of hne-grained action localization 
for temporally untrimmed web videos. We proposed to use 
noisily tagged web images to discover localized action frames 
(LAF) from videos, and model temporal information with 
LSTM networks. We conducted thorough evaluations on 
our collected FGA-240 data set and the public THUMOS 
2014 data set, and showed the effectiveness of LAF proposal 
by domain transfer from web images. 
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